Introduction

Introduction

The purpose of this study is to examine Spotify’s music library in terms of various factors, including popularity, genres, and releases. As a result, we will gain insight into the business strategy of the company.

In this report, we intend to answer the following questions:

  • How have the interests of Spotify users changed over time?
    • What is the distribution of genres across the Spotify library?
    • Does the popularity of a song correlate with its audio properties, including its loudness, adaptability, and acoustics?
    • How are these characteristics distributed across Spotify’s library?

Packages Required

In this project, the following R packages were used for data analysis:

Library Description
‘ggplot2’ a R package dedicated to data visualizations.
‘dplyr’ a R package dedicated to data wrangling.
‘tidyverse’ a R package dedicated to data wrangling and data manipulation.
‘knitr’ a R package dedicated to an engine for dynamic report generation.

Data Preparation

Data Source

From this repository GitHub , we have obtained the Spotify songs data for analysis.

Data Import

Spotify data includes 32833 tracks and 23 attributes, such as track_popularity, danceability, loudness, tempo, and other characteristics of songs from 2019 to the late 1950s.

The following table illustrates the first five rows of raw data:

spotify_data <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')

kable(head(spotify_data,5))
track_id track_name track_artist track_popularity track_album_id track_album_name track_album_release_date playlist_name playlist_id playlist_genre playlist_subgenre danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms
6f807x0ima9a1j3VPbc7VN I Don’t Care (with Justin Bieber) - Loud Luxury Remix Ed Sheeran 66 2oCs0DGTsRO98Gh5ZSl2Cx I Don’t Care (with Justin Bieber) [Loud Luxury Remix] 2019-06-14 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.748 0.916 6 -2.634 1 0.0583 0.1020 0.00e+00 0.0653 0.518 122.036 194754
0r7CVbZTWZgbTCYdfa2P31 Memories - Dillon Francis Remix Maroon 5 67 63rPSO264uRjW1X5E6cWv6 Memories (Dillon Francis Remix) 2019-12-13 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.726 0.815 11 -4.969 1 0.0373 0.0724 4.21e-03 0.3570 0.693 99.972 162600
1z1Hg7Vb0AhHDiEmnDE79l All the Time - Don Diablo Remix Zara Larsson 70 1HoSmj2eLcsrR0vE9gThr4 All the Time (Don Diablo Remix) 2019-07-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.675 0.931 1 -3.432 0 0.0742 0.0794 2.33e-05 0.1100 0.613 124.008 176616
75FpbthrwQmzHlBJLuGdC7 Call You Mine - Keanu Silva Remix The Chainsmokers 60 1nqYsOef1yKKuGOVchbsk6 Call You Mine - The Remixes 2019-07-19 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.718 0.930 7 -3.778 1 0.1020 0.0287 9.40e-06 0.2040 0.277 121.956 169093
1e8PAfcKUYoKkxPhrHqw4x Someone You Loved - Future Humans Remix Lewis Capaldi 69 7m7vv9wlQ4i0LFuJiE2zsQ Someone You Loved (Future Humans Remix) 2019-03-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.650 0.833 1 -4.672 1 0.0359 0.0803 0.00e+00 0.0833 0.725 123.976 189052

Data Structure

Spotify data includes 32833 tracks and 23 attributes.

print(paste('The data has',dim(spotify_data)[1],'rows and',dim(spotify_data)[2],'attributes'))
## [1] "The data has 32833 rows and 23 attributes"

Data Dictionary

We can see that 10 of the variables are character variables, while the remaining 13 are numerical variables. Below is a description of the spotify data.

spotify_data_dictionary <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
datatable(spotify_data_dictionary, options = list(
  autoWidth = TRUE,
  columnDefs = list(list(className = 'dt-center', targets = 3)),
  pageLength = 25,
  lengthMenu = c(5, 10, 15, 20, 25)
))

Summary Statistics

Below is a summary statistics of the spotify data (only for numerical variables). We will dig deeply into the data and determine if there are any outliers or missing values.

summary(spotify_data[,c(4,12:23)])
##  track_popularity  danceability        energy              key        
##  Min.   :  0.00   Min.   :0.0000   Min.   :0.000175   Min.   : 0.000  
##  1st Qu.: 24.00   1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000  
##  Median : 45.00   Median :0.6720   Median :0.721000   Median : 6.000  
##  Mean   : 42.48   Mean   :0.6548   Mean   :0.698619   Mean   : 5.374  
##  3rd Qu.: 62.00   3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000  
##  Max.   :100.00   Max.   :0.9830   Max.   :1.000000   Max.   :11.000  
##     loudness            mode         speechiness      acousticness   
##  Min.   :-46.448   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: -8.171   1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.0151  
##  Median : -6.166   Median :1.0000   Median :0.0625   Median :0.0804  
##  Mean   : -6.720   Mean   :0.5657   Mean   :0.1071   Mean   :0.1753  
##  3rd Qu.: -4.645   3rd Qu.:1.0000   3rd Qu.:0.1320   3rd Qu.:0.2550  
##  Max.   :  1.275   Max.   :1.0000   Max.   :0.9180   Max.   :0.9940  
##  instrumentalness       liveness         valence           tempo       
##  Min.   :0.0000000   Min.   :0.0000   Min.   :0.0000   Min.   :  0.00  
##  1st Qu.:0.0000000   1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96  
##  Median :0.0000161   Median :0.1270   Median :0.5120   Median :121.98  
##  Mean   :0.0847472   Mean   :0.1902   Mean   :0.5106   Mean   :120.88  
##  3rd Qu.:0.0048300   3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92  
##  Max.   :0.9940000   Max.   :0.9960   Max.   :0.9910   Max.   :239.44  
##   duration_ms    
##  Min.   :  4000  
##  1st Qu.:187819  
##  Median :216000  
##  Mean   :225800  
##  3rd Qu.:253585  
##  Max.   :517810

Data Preprocessing

####Let us first examine the columns of the dataset.

glimpse(spotify_data)
## Rows: 32,833
## Columns: 23
## $ track_id                 <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdfa…
## $ track_name               <chr> "I Don't Care (with Justin Bieber) - Loud Lux…
## $ track_artist             <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", "Th…
## $ track_popularity         <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, 6…
## $ track_album_id           <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E6…
## $ track_album_name         <chr> "I Don't Care (with Justin Bieber) [Loud Luxu…
## $ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", "20…
## $ playlist_name            <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Pop R…
## $ playlist_id              <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7cf…
## $ playlist_genre           <chr> "pop", "pop", "pop", "pop", "pop", "pop", "po…
## $ playlist_subgenre        <chr> "dance pop", "dance pop", "dance pop", "dance…
## $ danceability             <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.4…
## $ energy                   <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.8…
## $ key                      <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5,…
## $ loudness                 <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.38…
## $ mode                     <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, …
## $ speechiness              <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.127…
## $ acousticness             <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030, …
## $ instrumentalness         <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0.00e…
## $ liveness                 <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.143…
## $ valence                  <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.1…
## $ tempo                    <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, 1…
## $ duration_ms              <dbl> 194754, 162600, 176616, 169093, 189052, 16304…

Checking Missing Values

Secondly, we analyze the number of missing values per column so that we can determine whether they should be dropped, retained, or imputed with mean/median values.

colSums(is.na(spotify_data))
##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0
spotify_data %>% 
  filter_all(any_vars(is.na(.)))
## # A tibble: 5 × 23
##   track_id               track_name track_artist track_popularity track_album_id
##   <chr>                  <chr>      <chr>                   <dbl> <chr>         
## 1 69gRFGOWY9OMpFJgFol1u0 <NA>       <NA>                        0 717UG2du6utFe…
## 2 5cjecvX0CmC9gK0Laf5EMQ <NA>       <NA>                        0 3luHJEPw434tv…
## 3 5TTzhRSWQS4Yu8xTgAuq6D <NA>       <NA>                        0 3luHJEPw434tv…
## 4 3VKFip3OdAvv4OfNTgFWeQ <NA>       <NA>                        0 717UG2du6utFe…
## 5 69gRFGOWY9OMpFJgFol1u0 <NA>       <NA>                        0 717UG2du6utFe…
## # ℹ 18 more variables: track_album_name <chr>, track_album_release_date <chr>,
## #   playlist_name <chr>, playlist_id <chr>, playlist_genre <chr>,
## #   playlist_subgenre <chr>, danceability <dbl>, energy <dbl>, key <dbl>,
## #   loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
## #   instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
## #   duration_ms <dbl>

This data contains missing values in three variables (1. track_name, 2. track_album_name, and 3. track_artist) which represents less than 0.1% of the data. We have therefore decided to remove the rows with NA values, as this will not affect our analysis.

spotify_data <- spotify_data %>% drop_na() 

Identifying Duplicate Values

In the data dictionary, “track_id” is described as an identifier for the songs in the data set. We checked to see whether there were duplicates of the “track_id” column. This resulted in 4481 duplicates being dropped, making the new dimensions of the cleaned data 28352 rows and 23 attributes.

spotify_data %>% distinct(track_id, .keep_all=TRUE) %>% dim()
## [1] 28352    23
spotify_data <- spotify_data %>% distinct(track_id,.keep_all=TRUE)

Creating New Variable(s)

We see that the “duration_ms” column is given in milliseconds. This is not a standard measurement for the duration of songs. Therefore, we created a new variable called “duartion_m” which records the duration of the songs in minutes.

spotify_data <- spotify_data %>% mutate(duration_m = duration_ms/60000)
spotify_data <- select(spotify_data, -duration_ms)
colnames(spotify_data)
##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "duration_m"

Upon reviewing the data, we find that track popularity varies based on both time and genre, therefore we wish to evaluate this relationship further in the exploratory data analysis section for which we will extract the year of the “track_album_release_date” column and create the variable “track_album_release_year” for a yearly trend analysis instead of a minute-by-minute analysis.

spotify_data$track_album_release_date <- as.Date(spotify_data$track_album_release_date)
spotify_data$track_album_release_year <- as.numeric(format(spotify_data$track_album_release_date, "%Y"))
track_popularity_uniques <- spotify_data %>% distinct(track_popularity) %>% select(track_popularity)
tags <- c("[0-20]","(20-40]", "(40-60]", "(60-80]", "(80-100]", "(100+]")

spotify_data_binned <- spotify_data %>% 
  mutate(track_popularity_tag = case_when(
    track_popularity <= 20 ~ tags[1],
    track_popularity > 20 & track_popularity <= 40 ~ tags[2],
    track_popularity > 40 & track_popularity <= 60 ~ tags[3],
    track_popularity > 60 & track_popularity <= 80 ~ tags[4],
    track_popularity > 80 & track_popularity <= 100 ~ tags[5],
    track_popularity > 100 ~ tags[6]
    ))
spotify_data_binned %>% distinct(track_popularity_tag)
## # A tibble: 5 × 1
##   track_popularity_tag
##   <chr>               
## 1 (60-80]             
## 2 (40-60]             
## 3 (20-40]             
## 4 [0-20]              
## 5 (80-100]

Checking Outlier(s)

To determine if outliers in the dataset should be removed, retained or imputed, we plot the boxplots below for each of the numerical attributes.

spotify_pivot <- spotify_data_binned %>% select(12:22) %>% pivot_longer(cols = danceability:tempo, names_to = 
"Var", values_to = "val")
ggplot(spotify_pivot, aes(y = val, fill  = Var))+
  geom_boxplot(show.legend = FALSE, width = .6, position = "dodge")+
  coord_flip() +
  facet_wrap(vars(Var), ncol=3, scales = "free") + scale_fill_grey()

In the absence of domain expertise, we will not be able to remove these outliers at this point as they may provide some insight on the popularity of tracks with audience, which can then be taken into account to increase their popularity.

Histograms

The following histograms illustrate the skewness of the data.

ggplot(spotify_pivot, aes(x = val, fill  = Var))+
  geom_histogram(show.legend = FALSE,  position = "dodge") +
  facet_wrap(vars(Var), ncol=3, scales = "free") + scale_fill_grey()

Data Preview

The following is the final preview of the cleaned data.

spotify_data_cleaned <- spotify_data_binned
datatable(head(spotify_data_cleaned, 25), options = list(
  scrollCollapse = TRUE,scrollX = TRUE,
  autoWidth = TRUE,
  columnDefs = list(list(className = 'dt-center', targets = 5)),
  pageLength = 5,
  lengthMenu = c(5, 10, 15, 20, 25)
))

Proposed Exploratory Data Analysis

As a first step, we can introduce and analyze the user behavior patterns in detail, and then propose statistical models (for example, regression models, etc,) to answer our research questions and gain some insights.

Correlation Analysis

First, we investigate the correlation between the song attributes to determine if there are any statistically significant dependent variables:

corr_data <-select(spotify_data_cleaned,track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo)
corrplot(cor(corr_data), tl.col = 'black')

  • Observations
    • The correlation between energy and loudness of the track is highly significant.
    • The correlation plot shows that track popularity is not strongly correlated with any of the audio characteristics.
    • There is a negative correlation between energy and acoustic quality; loudness and acoustic quality have a negative correlation as well.

Analysis of the release of tracks by genres

We plot the number of tracks released between 2005 and 2019 based on genres and develop insights from this data.

song_years_genre_df <- spotify_data_cleaned %>%
  filter(track_album_release_year> 2005 & track_album_release_year<=2019)%>%
  select('track_album_release_year', 'playlist_genre')  %>%
  group_by(track_album_release_year, playlist_genre) %>%
  summarise(songs_released = n()) %>%
  ungroup()
ggplot(song_years_genre_df, aes(x = track_album_release_year, y = songs_released)) +
  geom_line(aes(color = playlist_genre)) + 
    ggtitle("The number of songs released by each genre over the years") + 
      ylab("Song releases") +xlab("Year of release")

  • Observations
    • Among all other genres, rock releases the least number of songs each year.
    • Until 2010, EDM wasn’t that popular, however the number of EDM songs released increased dramatically after 2013 and reached its peak in 2019.

Recap

We have briefly examined the user behavior patterns. We will dive into the details and develop statistical models (for example, regression models, etc.) to answer our research questions.